Homework - Reinforcement Learning - Part B (60/100 points)

by Todd Gureckis and Brenden Lake
Computational Cognitive Modeling
NYU class webpage: https://brendenlake.github.io/CCM-site/
email to course instructors: instructors-ccm-spring2019@nyuccl.org


Learning and deciding in an unknown world

Part A of the homework explored various solution to a sequential decision making problem in a case where considerable information about the environment was known or was provided (e.g., the probabilities of transitioning between different states and the magnitude of rewards available in particular states). However, if reinforcement learning could only be applied to cases with full, explicit knowledge than it would be much less of a compelling model of human learning. In many cases, humans and other animals learn even when there is much more ambiguity, and as a result, a good model of experiential learning for humans would apply to cases where less is known a-priori about the world.

In this part of the homework, we will shift to think about learning and deciding in a unknown environment. This is a fairly complex topic with many different solutions and types of problems. However, we will focus on one particular problem class known as the n-armed bandit. N-armed bandits are optimization problems that mimic many real-world problems faced by humans, organizations, and machine learning agents. The term "bandit" comes from the name of the casino games where you pull a lever to enter a lottery. The bandits have one arm (the arm you pull down) and they steal your money (see above).

An N-armed bandit is a problem where a decision maker is presented with a bandit with $n$ arms instead of just one (see Octopus cartoon). The task for the agent is, on each trial or moment in time, to choose bandits that are good while avoiding those that are less good. Since nothing may be known about the bandits a-priori, the problem is difficult and requires a balance of exploration (trying new things in order to learn) and exploitation (choosing options known to be good).

If each bandit paid out a fixed amount every time it was selected, then the problem would be solved with very simple exhaustive search process (visit each bandit once and then select the best one for the remaining time). However, the sequential search strategy just described doesn't capture the opportunity cost of exploration. For example, imagine that there is 100 armed bandits. Further assume that you know that 98 give zero reward, one gives a reward of 10, and one gives a reward of 20. If on the first pull you receive 10 units of reward then you are lucky and landed on a good one. However, is it worth going searching for the 20 point bandit? Given that you will have to pull a lot of zero reward bandits, it might actually be more rewarding over a finite period to continue to pull the 10 point bandit arm. Thus, exploration and exploitation act more like a tradeoff depending on the structure of the problem.

In addition, when the reward received from each bandit is probabilistic or stochastic, and furthermore the quality of the bandits might change over time, the problem becomes much more difficult. These cases require the agent to learn from the past but also be willing to adjust their beliefs based on more recent information.

Bandit tasks come up in many areas of cognitive science and machine learning. For example, there is a way to view A/B testing on websites as a particular type of bandit problem (your goal is to ensure conversions or purchases on your website, and your bandit arms are the different web designs you might try out). Similarly, the very real human problem of deciding where to eat lunch is a bit like a bandit problem -- should you return to your favorite restuarant or try a new one? Is the exploration worth giving up a reliably good meal?

In this part of the homework you will explore different simple algorithms for bandit problems.

Starter code

In [58]:
# The typical imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # delete later
import random
import math

# specific for plotly magic
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
import colorlover as cl
init_notebook_mode(connected=False)
def color_to_rgba(color, alpha):
    return  "rgba(%s,%s,%s,%s)" % (color+(alpha,))

A simple bandit environment

The first class provided here creates a set of simple, stationary multi-arm bandits. The bandits are stateless in that the reward from choose each action is simply a probabilistic function of the bandit itself, and there are no other cues you can use to decide which action to take. The parameters to the constructor of the bandit environment are:

  • k: the number of bandits
  • mu: the mean of the distribution from which the bandit means are drawn from
  • sigma: the standard deviation of the distribution from which the bandit means are drawn from
  • sd: the standard deviation for all the bandits
In [59]:
class KArmBanditStationary():
    def __init__(self, k, mu=0, sigma=1, sd=10):
        self.k = k
        self.action_means = np.random.normal(mu, sigma, k)
        self.action_sds = sd
        self.optimal = np.argmax(self.action_means)
        
    def step(self, action):
        return np.random.normal(self.action_means[action], self.action_sds), action==self.optimal

class KArmBanditRestless():
    def __init__(self, k, mu=0, sigma=1, sd=10):
        self.k = k
        self.action_means = np.random.normal(mu, sigma, k)
        self.action_sds = sd
        self.optimal = np.argmax(self.action_means)
        
    # The following code is modified by Alfred
    def step(self, action):
        self.action_means+=np.random.normal(0, 0.1, self.k)
        return np.random.normal(self.action_means[action], self.action_sds), action==self.optimal
    

Your job in this first exercise is to write a simple RL agent which samples from these bandits and attempts to earn as much reward as possible. The following cell gives an example of how to initialize the bandit and how to draw from it

In [60]:
bandit = KArmBanditStationary(10, 0, 5, 10)
action = 0 # pull the 0th bandit
bandit.step(action)  # return values are the reward, and if the action is actually optimal or not
Out[60]:
(5.418389723108475, False)

This cell implements a simple random agent. You will want to modify this class to create an agent that can learn.

In [61]:
class RandomAgent():
    def __init__(self, k):
        self.num_actions = k
        # you could add parameters to your agent here
        pass
    
    def choose(self):
        return random.randrange(self.num_actions)
    
    def learn(self, reward, action): # this agent doesn't learn
        pass
    
class EpsilonGreedyAgent():
    def __init__(self, k, EPSILON):
        self.num_actions = k
        self.EPSILON = EPSILON # Added by Alfred
        self.band_rwd_count = [0 for i in range(k)] # Added by Alfred
        self.band_rwd_hist = [0 for i in range(k)] # Added by Alfred
        # you could add parameters to your agent here
        pass
    
    # The following code is modified by Alfred
    def choose(self):
        if random.random()<self.EPSILON:
            return random.randrange(self.num_actions)
        else:
            return np.argmax(self.band_rwd_hist)
    
    # The following code is modified by Alfred
    def learn(self, reward, action): # this agent doesn't learn
        self.band_rwd_hist[action] = (reward + self.band_rwd_count[action]*self.band_rwd_hist[action])/(self.band_rwd_count[action]+1)
        self.band_rwd_count[action]+=1
#        self.max_action = action # Added by Alfred
        pass

class EpsilonGreedyDynamicAgent():
    def __init__(self, k, EPSILON, GAMMA):
        self.num_actions = k
        self.EPSILON = EPSILON # Added by Alfred
        self.GAMMA = GAMMA # Added by Alfred
        self.band_rwd_count = [0 for i in range(k)] # Added by Alfred
        self.band_rwd_hist = [0 for i in range(k)] # Added by Alfred
        # you could add parameters to your agent here
        pass
    
    # The following code is modified by Alfred
    def choose(self):
        if random.random()<self.EPSILON:
            return random.randrange(self.num_actions)
        else:
            return np.argmax(self.band_rwd_hist)

    # The following code is modified by Alfred
    def learn(self, reward, action): # this agent doesn't learn
        self.band_rwd_hist[action] = (reward + self.GAMMA*self.band_rwd_count[action]*self.band_rwd_hist[action])/(self.band_rwd_count[action]+1)
        self.band_rwd_count[action]+=1
        pass

class UCB1Agent():
    def __init__(self, k):
        self.num_actions = k
        self.band_rwd_count = [0 for i in range(k)] # Added by Alfred
        self.band_rwd_hist = [0 for i in range(k)] # Added by Alfred
        self.band_rwd = [0 for i in range(k)] # Added by Alfred
        # you could add parameters to your agent here
        pass
    
    # The following code is modified by Alfred
    def choose(self, n):        
        if n>self.num_actions:
            for i in range(self.num_actions):
                if self.band_rwd_count[action]!=0:
                    self.band_rwd[i] = self.band_rwd_hist[i] + np.sqrt(2*np.log(n)/self.band_rwd_count[action])

            return np.argmax(self.band_rwd)
        else:
            return random.randrange(self.num_actions)

    # The following code is modified by Alfred
    def learn(self, reward, action): # this agent doesn't learn
        self.band_rwd_hist[action] = (reward + self.band_rwd_count[action]*self.band_rwd_hist[action])/(self.band_rwd_count[action]+1)
        self.band_rwd_count[action]+=1
        pass

This cell helps you plot the reward history including a smoothed average reward earned by the agents over the last 30 trials

In [62]:
def plot_reward_history(reward_history, y_axis_title):
    df=pd.DataFrame(reward_history,columns=["reward"])
    df_smooth = df.rolling(30).mean()
    df_smooth = np.round(df_smooth,3)

    color1=cl.to_numeric(cl.scales['7']['qual']['Paired'])[0]
    color2=cl.to_numeric(cl.scales['7']['qual']['Paired'])[2]
    trace0 = go.Scatter(
        x = df.index,
        y = df.reward.astype(float),
        mode = 'lines',
        name = 'reward',
        line=go.scatter.Line(color=color_to_rgba(color1, 0.3))
    )

    trace1 = go.Scatter(
        x = df_smooth.index,
        y = df_smooth.reward.astype(float),
        mode = 'lines',
        name = 'rolling average',
        line=go.scatter.Line(color=color_to_rgba(color2, 0.8))
    )


    layout = go.Layout(
        title = "Performance",
        xaxis = dict(title="Episode"),
        yaxis = dict(title=y_axis_title) 
    )

    data = [trace0, trace1]
    fig = go.Figure(data = data, layout = layout)

    iplot(fig, filename='layout')

Finally, this is an example of the random agent performance in the environment.

In [63]:
timesteps = 1000
n_bandits = 10

agent = RandomAgent(n_bandits)
#agent = EpsilonGreedyAgent(n_bandits, 0.5) # Added by Alfred
#agent = EpsilonGreedyDynamicAgent(n_bandits, 0.6, 0.2) # Added by Alfred
#agent = UCB1Agent(n_bandits) # Added by Alfred
bandit =  KArmBanditStationary(n_bandits, mu=0, sigma=5, sd=10)
#bandit =  KArmBanditRestless(n_bandits, mu=0, sigma=5, sd=10) # Added by Alfred

reward_history = []
opt_history = []
choice_history = [] # Added by Alfred

for i in range(timesteps):
    choice = agent.choose()
    #choice = agent.choose(i+1) # Added by Alfred
    reward, opt = bandit.step(choice)
    agent.learn(reward,choice) # Added by Alfred for the class EpsilonGreedyAgent()
    reward_history.append(reward)
    opt_history.append(opt)
    choice_history.append(choice) # Added by Alfred
    
plot_reward_history(reward_history, "Reward")
plot_reward_history(opt_history, "P(choose==optimal)")
In [65]:
timesteps = 1000
n_bandits = 10

for j in range(5):
    agent = EpsilonGreedyAgent(n_bandits, 0.2*j+0.1) # Added by Alfred
    bandit =  KArmBanditStationary(n_bandits, mu=0, sigma=5, sd=10)
    reward_history = []
    opt_history = []
    choice_history = [] # Added by Alfred
    for i in range(timesteps):
        choice = agent.choose()
        reward, opt = bandit.step(choice)
        agent.learn(reward,choice) # Added by Alfred for the class EpsilonGreedyAgent()
        reward_history.append(reward)
        opt_history.append(opt)
        choice_history.append(choice) # Added by Alfred
    plot_reward_history(reward_history, "Reward (Epsilon = " + str(0.2*j+0.1) + ")")
    plot_reward_history(opt_history, "P(choose==optimal) (Epsilon = " + str(0.2*j+0.1) + ")")

When experimented with 'EpsilonGreedyAgent()' function, we find that for lower values of $\epsilon$, the agent behaves close to the 'RandomAgent()' and the probability of choosing the optimal action stays close to $0.2$. Whereas, for higher values of $\epsilon$, the probability of choosing the optimal action becomes much higher than $0.2$. But, if the value of $\epsilon$ is close to $1.0$, then the agent although greedy, might actually end up not finding the optimal action due to lack of much exploration, in some sense. So, the probability of choosing the optimal again reduces.

In [72]:
timesteps = 1000
n_bandits = 10

for j in range(5):
    if(j==0):
        agent = EpsilonGreedyAgent(n_bandits, 0.5) # Added by Alfred
    else:
        agent = EpsilonGreedyDynamicAgent(n_bandits, 0.6, 0.1*j) # Added by Alfred
    bandit =  KArmBanditRestless(n_bandits, mu=0, sigma=5, sd=10) # Added by Alfred
    reward_history = []
    opt_history = []
    choice_history = [] # Added by Alfred
    for i in range(timesteps):
        choice = agent.choose()
        reward, opt = bandit.step(choice)
        agent.learn(reward,choice) # Added by Alfred for the class EpsilonGreedyAgent()
        reward_history.append(reward)
        opt_history.append(opt)
        choice_history.append(choice) # Added by Alfred
    if(j==0):
        plot_reward_history(reward_history, "Reward (Non-dynamic greedy agent)")
        plot_reward_history(opt_history, "P(choose==optimal)")
    else:
        plot_reward_history(reward_history, "Reward (gamma = " + str(0.1*j) + ")")
        plot_reward_history(opt_history, "P(choose==optimal) (gamma = " + str(0.1*j) + ")")

In the function 'KArmBanditRestless', a small-width Gaussian noise is added to the mean reward of each action. This modified mean reward becomes the new mean reward for each action. Now, since the mean rewards are getting modified a little in each step, the new 'EpsilonGreedyDynamicAgent()' gives less weightage to (or discounts) the rewards obtained from earlier steps. This makes sure that we give more preference to the recent rewards than the earlier rewards while choosing the optimal action. Also, for higher values of $\gamma$, we have more uncertainity in determining the optimal action, whereas it gets easier for lower values of $\gamma$.

In [77]:
timesteps = 1000
n_bandits = 10

for j in range(8):
    if(j==0):
        agent = UCB1Agent(n_bandits) # Added by Alfred
        bandit =  KArmBanditStationary(n_bandits, mu=0, sigma=1, sd=10)
    elif(j==1):
        agent = EpsilonGreedyAgent(n_bandits, 0.5) # Added by Alfred
        bandit =  KArmBanditStationary(n_bandits, mu=0, sigma=1, sd=10)
    elif(j==2):
        agent = UCB1Agent(n_bandits) # Added by Alfred
        bandit =  KArmBanditRestless(n_bandits, mu=0, sigma=1, sd=10)
    elif(j==3):
        agent = EpsilonGreedyAgent(n_bandits, 0.5) # Added by Alfred
        bandit =  KArmBanditRestless(n_bandits, mu=0, sigma=1, sd=10)
    elif(j==4):
        agent = UCB1Agent(n_bandits) # Added by Alfred
        bandit =  KArmBanditStationary(n_bandits, mu=0, sigma=5, sd=10)
    elif(j==5):
        agent = EpsilonGreedyAgent(n_bandits, 0.5) # Added by Alfred
        bandit =  KArmBanditStationary(n_bandits, mu=0, sigma=5, sd=10)
    elif(j==6):
        agent = UCB1Agent(n_bandits) # Added by Alfred
        bandit =  KArmBanditRestless(n_bandits, mu=0, sigma=5, sd=10)
    else:
        agent = EpsilonGreedyAgent(n_bandits, 0.5) # Added by Alfred
        bandit =  KArmBanditRestless(n_bandits, mu=0, sigma=5, sd=10)
    reward_history = []
    opt_history = []
    choice_history = [] # Added by Alfred
    for i in range(timesteps):
        if(j%2==0):
            choice = agent.choose(i+1)
        else:
            choice = agent.choose()
        reward, opt = bandit.step(choice)
        agent.learn(reward,choice) # Added by Alfred for the class EpsilonGreedyAgent()
        reward_history.append(reward)
        opt_history.append(opt)
        choice_history.append(choice) # Added by Alfred
    if(j==0):
        plot_reward_history(reward_history, "Reward (UCB1Agent & stnry bandit (sigma=1))")
        plot_reward_history(opt_history, "P(choose==optimal) (UCB1Agent & stnry bandit (sigma=1))")
    elif(j==1):
        plot_reward_history(reward_history, "Reward (EpsilonGreedyAgent & stnry bandit (sigma=1))")
        plot_reward_history(opt_history, "P(choose==optimal) (EpsilonGreedyAgent & stnry bandit (sigma=1))")
    elif(j==2):
        plot_reward_history(reward_history, "Reward (UCB1Agent & restless bandit (sigma=1))")
        plot_reward_history(opt_history, "P(choose==optimal) (UCB1Agent & restless bandit (sigma=1))")
    elif(j==3):
        plot_reward_history(reward_history, "Reward (EpsilonGreedyAgent & restless bandit (sigma=1))")
        plot_reward_history(opt_history, "P(choose==optimal) (EpsilonGreedyAgent & restless bandit (sigma=1))")
    elif(j==4):
        plot_reward_history(reward_history, "Reward (UCB1Agent & stnry bandit (sigma=5))")
        plot_reward_history(opt_history, "P(choose==optimal) (UCB1Agent & stnry bandit (sigma=5))")
    elif(j==5):
        plot_reward_history(reward_history, "Reward (EpsilonGreedyAgent & stnry bandit (sigma=5))")
        plot_reward_history(opt_history, "P(choose==optimal) (EpsilonGreedyAgent & stnry bandit (sigma=5))")
    elif(j==6):
        plot_reward_history(reward_history, "Reward (UCB1Agent & restless bandit (sigma=5))")
        plot_reward_history(opt_history, "P(choose==optimal) (UCB1Agent & restless bandit (sigma=5))")
    elif(j==7):
        plot_reward_history(reward_history, "Reward (EpsilonGreedyAgent & restless bandit (sigma=5))")
        plot_reward_history(opt_history, "P(choose==optimal) (EpsilonGreedyAgent & restless bandit (sigma=5))")

In the 'UCB1Agent()', when the std. dev. of all the rewards is very high compared to the deviations of the means of the rewards (here $10>>1$, i.e. $sd>>sigma$), it becomes extremely difficult for this agent to determine the optimal action among actions with very close mean rewards. In fact, it keeps repeating some action that need not be the optimal one for the rest of the period. But, when the std. dev. of all the rewards are closer in magnitude to the deviation of the mean of the rewards, then 'UCB1Agent()' determines the optimal agent much quicker than the the previously mentioned random or epsilon-greedy agents, and keeps repeating this optimal action for the rest of the period.

In [78]:
# import the gridworld library
import numpy as np
import random
import math
import statistics
from copy import deepcopy
from IPython.display import display, Markdown, Latex, HTML
from gridworld import GridWorld, random_policy
In [79]:
gridworld = [
       [ 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'x', 'g'],
       [ 'o', 'o', 'x', 'o', 'o', 'o', 'o', 'x', 'o'],
       [ 'o', 'o', 'x', 'o', 'o', 'o', 'o', 'x', 'o'],
       [ 'o', 'o', 'x', 'o', 'o', 'o', 'o', 'o', 'o'],
       [ 'o', 'o', 'o', 'o', 'o', 'x', 'o', 'o', 'o'],
       [ 's', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o']
    ] # the problem described above, 'x' is a wall, 's' is start, 'g' is goal, and 'o' is a normal room

mygrid = GridWorld(gridworld)
mygrid.raw_print()  # print out the grid world
mygrid.index_print() # print out the indicies of each state
mygrid.coord_print() # print out the coordinates of each state (helpful in your code)

# define the rewards as a hash table
rewards={}

# mygrid.transitions contains all the pairwise state-state transitions allowed in the grid
# for each state transition intialize the reward to zero
for start_state in mygrid.transitions:
    for action in mygrid.transitions[start_state].keys():
        next_state = mygrid.transitions[start_state][action]
        rewards[str([start_state, action, next_state])] = 0.0

# now set the reward for moving up into state 8 (the goal state) to +10
rewards[str([17, 'up', 8])] = 10

# now set the penalty for walking off the edge of the grid and returning to state 45 (the start state)
for i in [0,1,2,3,4,5,6,7]:
    rewards[str([i, 'up', 45])] = -1
for i in [0,9,18,27,36,45]:
    rewards[str([i, 'left', 45])] = -1
for i in [45,46,47,48,49,50,51,52,53]:
    rewards[str([i, 'down', 45])] = -1
for i in [8,17,26,35,44,53]:
    rewards[str([i, 'right', 45])] = -1

Welcome to your new Grid World!

Raw World Layout

oooooooxg
ooxooooxo
ooxooooxo
ooxoooooo
oooooxooo
soooooooo

Indexes of each grid location as an id number

0 1 2 3 4 5 6 7 8
91011121314151617
181920212223242526
272829303132333435
363738394041424344
454647484950515253

Indexes of each grid location as a tuple

(0,0)(0,1)(0,2)(0,3)(0,4)(0,5)(0,6)(0,7)(0,8)
(1,0)(1,1)(1,2)(1,3)(1,4)(1,5)(1,6)(1,7)(1,8)
(2,0)(2,1)(2,2)(2,3)(2,4)(2,5)(2,6)(2,7)(2,8)
(3,0)(3,1)(3,2)(3,3)(3,4)(3,5)(3,6)(3,7)(3,8)
(4,0)(4,1)(4,2)(4,3)(4,4)(4,5)(4,6)(4,7)(4,8)
(5,0)(5,1)(5,2)(5,3)(5,4)(5,5)(5,6)(5,7)(5,8)
In [80]:
# The following code is added by Alfred
def be_greedy(q_values):
    if len(q_values)==0:
        return {}
    
    keys = list(q_values.keys())
    vals = [q_values[i] for i in keys]    
    maxqs = [i for i,x in enumerate(vals) if x==max(vals)]
    if len(maxqs)>1:
        pos = random.choice(maxqs)
    else:
        pos = maxqs[0]
    policy = deepcopy(q_values)
    for i in policy.keys():
        policy[i]=0.0
    policy[keys[pos]]=1.0
    return policy

# The following code is added by Alfred
def epsilon_greedy(q_values, EPISILON):
    if random.random() < EPISILON:
        return random.choice(list(q_values.keys()))
    else:
        if len(q_values)==0:
            return {}  
        keys = list(q_values.keys())
        vals = [q_values[i] for i in keys]    
        maxqs = [i for i,x in enumerate(vals) if x==max(vals)]
        if len(maxqs)>1:
            pos = random.choice(maxqs)
        else:
            pos = maxqs[0]
        policy = deepcopy(q_values)
        for i in policy.keys():
            policy[i]=0.0
        #policy[keys[pos]]=1.0
        return keys[pos]

        '''if q_values['up']==1.0:
            return 'up'
        elif q_values['right']==1.0:
            return 'right'
        elif q_values['down']==1.0:
            return 'down'
        elif q_values['left']==1.0:
            return 'left'
'''
    
# The following code is added and modified by Alfred
#recursively sample state-action transitions using epsilon greedy algorithm with a maximum recursion depth of 100.
def mc_episode(current_state, EPSILON, GAMMA, ALPHA, goal_state, policy_table, q_value_table, depth=0, max_depth=100):
    if current_state!=goal_state and depth<max_depth:
        sx, sy = mygrid.index_to_coord(current_state)
        action = epsilon_greedy(q_value_table[sx][sy],EPSILON)
        if action == 'up':
            new_state = mygrid.up(current_state)
        elif action == 'right':
            new_state = mygrid.right(current_state)
        elif action == 'down':
            new_state = mygrid.down(current_state)
        elif action == 'left':
            new_state = mygrid.left(current_state)
        r = rewards[str([current_state,action,new_state])]
        
        #$Q(s,a) = Q(s,a) + \alpha [r + \gamma_{a'} Q(s',a') - Q(s,a)]$
        sx1, sy1 = mygrid.index_to_coord(new_state)
        action1 = epsilon_greedy(q_value_table[sx1][sy1],0.0) # Added by Alfred
        q_value_table[sx][sy][action] = q_value_table[sx][sy][action] + ALPHA*(r + GAMMA*q_value_table[sx1][sy1][action1] - q_value_table[sx][sy][action])
        q_value_table = mc_episode(new_state, EPSILON, GAMMA, ALPHA, goal_state, policy_table, q_value_table, depth+1)

    return q_value_table
In [81]:
starting_state = 45
goal_state = 8 # terminate the MC roll out when you get to this state
GAMMA=0.9
EPSILON = 0.2
ALPHA = 0.2 # Added by Alfred


# set up initial data strucutres that might be useful for you
# q(s,a) - the q-values for each action in each state
def zero_q_values():
    qvals = {"up": 0.0, "right": 0.0, "down": 0.0, "left": 0.0}
    return qvals
q_value_table = [[zero_q_values() for i in range(mygrid.ncols)] for j in range(mygrid.nrows)]

# pi - the policy table
policy_table = [[random_policy() for i in range(mygrid.ncols)] for j in range(mygrid.nrows)]
display(Markdown("**Initial (randomized) policy**"))
mygrid.pretty_print_policy_table(policy_table)


# The following code is added and modified by Alfred
for i in range(100000):  # you probably need to take many, many steps here and it make take some time to run
    q_value_table = mc_episode(starting_state, EPSILON, GAMMA, ALPHA, goal_state, policy_table, q_value_table)

# The following code is modified by Alfred
# improve policy
for sx in range(len(q_value_table)):
    for sy in range(len(q_value_table[0])):
        policy_table[sx][sy] = be_greedy(q_value_table[sx][sy])
        
display(Markdown("**Improved policy**"))
mygrid.pretty_print_policy_table(policy_table)

Initial (randomized) policy

Improved policy

Turning in homework

When you are finished with this notebook. Save your work in order to turn it in. To do this select File->Download As...->HTML.

You can turn in your assignments using NYU Classes webpage for the course (available on https://home.nyu.edu). Make sure you complete all parts (A and B) of this homework.